cluster validity index
Absolute indices for determining compactness, separability and number of clusters
Bagirov, Adil M., Aliguliyev, Ramiz M., Sultanova, Nargiz, Taheri, Sona
Finding "true" clusters in a data set is a challenging problem. Clustering solutions obtained using different models and algorithms do not necessarily provide compact and well-separated clusters or the optimal number of clusters. Cluster validity indices are commonly applied to identify such clusters. Nevertheless, these indices are typically relative, and they are used to compare clustering algorithms or choose the parameters of a clustering algorithm. Moreover, the success of these indices depends on the underlying data structure. This paper introduces novel absolute cluster indices to determine both the compactness and separability of clusters. We define a compactness function for each cluster and a set of neighboring points for cluster pairs. This function is utilized to determine the compactness of each cluster and the whole cluster distribution. The set of neighboring points is used to define the margin between clusters and the overall distribution margin. The proposed compactness and separability indices are applied to identify the true number of clusters. Using a number of synthetic and real-world data sets, we demonstrate the performance of these new indices and compare them with other widely-used cluster validity indices.
- Europe > Switzerland (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- (3 more...)
Improving internal cluster quality evaluation in noisy Gaussian mixtures
de Amorim, Renato Cordeiro, Makarenkov, Vladimir
Improving clustering quality evaluation in noisy Gaussian mixtures Renato Cordeiro de Amorim Vladimir Makarenkov Abstract Clustering is a well-established technique in machine learning and data analysis, widely used across various domains. Cluster validity indices, such as the Average Silhouette Width, Calinski-Harabasz, and Davies-Bouldin indices, play a crucial role in assessing clustering quality when external ground truth labels are unavailable. However, these measures can be affected by the feature relevance issue, potentially leading to unreliable evaluations in high-dimensional or noisy data sets. We introduce a theoretically grounded Feature Importance Rescaling (FIR) method that enhances the quality of clustering validation by adjusting feature contributions based on their dispersion. It attenuates noise features, clarifies clustering compactness and separation, and thereby aligns clustering validation more closely with the ground truth. Through extensive experiments on synthetic data sets under different configurations, we demonstrate that FIR consistently improves the correlation between the values of cluster validity indices and the ground truth, particularly in settings with noisy or irrelevant features. The results show that FIR increases the robustness of clustering evaluation, reduces variability in performance across different data sets, and remains effective even when clusters exhibit significant overlap. These findings highlight the potential of FIR as a valuable enhancement of clustering validation, making it a practical tool for unsupervised learning tasks where labelled data is unavailable. Mila - Quebec AI Institute, Montreal, QC, Canada.Keywords: Cluster validity indices, data rescaling, noisy data. 1 Introduction Clustering is a fundamenta technique in machine learning and data analysis, which is central to many exploratory methods.
- North America > Canada > Quebec > Montreal (0.24)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > California > Alameda County > Oakland (0.04)
- Europe > Poland > Masovia Province > Warsaw (0.04)
From A-to-Z Review of Clustering Validation Indices
Hassan, Bryar A., Tayfor, Noor Bahjat, Hassan, Alla A., Ahmed, Aram M., Rashid, Tarik A., Abdalla, Naz N.
Data clustering involves identifying latent similarities within a dataset and organizing them into clusters or groups. The outcomes of various clustering algorithms differ as they are susceptible to the intrinsic characteristics of the original dataset, including noise and dimensionality. The effectiveness of such clustering procedures directly impacts the homogeneity of clusters, underscoring the significance of evaluating algorithmic outcomes. Consequently, the assessment of clustering quality presents a significant and complex endeavor. A pivotal aspect affecting clustering validation is the cluster validity metric, which aids in determining the optimal number of clusters. The main goal of this study is to comprehensively review and explain the mathematical operation of internal and external cluster validity indices, but not all, to categorize these indices and to brainstorm suggestions for future advancement of clustering validation research. In addition, we review and evaluate the performance of internal and external clustering validation indices on the most common clustering algorithms, such as the evolutionary clustering algorithm star (ECA*). Finally, we suggest a classification framework for examining the functionality of both internal and external clustering validation measures regarding their ideal values, user-friendliness, responsiveness to input data, and appropriateness across various fields. This classification aids researchers in selecting the appropriate clustering validation measure to suit their specific requirements.
- Asia > Middle East > Iraq > Kurdistan Region > Sulaymaniyah Governorate > Sulaymaniyah (0.04)
- Asia > Middle East > Iran > Tehran Province > Tehran (0.04)
- South America > Brazil > Paraná > Curitiba (0.04)
- (8 more...)
A new validity measure for fuzzy c-means clustering
ABSTRACT: A new cluster validity index is proposed for fuzzy clusters obtained from fuzzy c-means algorithm. The proposed validity index exploits inter-cluster proximity between fuzzy clusters. Inter-cluster proximity is used to measure the degree of overlap between clusters. A low proximity value refers to well-partitioned clusters. The best fuzzy c-partition is obtained by minimizing inter-cluster proximity with respect to c. Well-known data sets are tested to show the effectiveness and reliability of the proposed index.
A Bayesian cluster validity index
Wiroonsri, Nathakhun, Preedasawakul, Onthada
Selecting the number of clusters is one of the key processes when applying clustering algorithms. To fulfill this task, various cluster validity indices (CVIs) have been introduced. Most of the cluster validity indices are defined to detect the optimal number of clusters hidden in a dataset. However, users sometimes do not expect to get the optimal number of groups but a secondary one which is more reasonable for their applications. This has motivated us to introduce a Bayesian cluster validity index (BCVI) based on existing underlying indices. This index is defined based on either Dirichlet or Generalized Dirichlet priors which result in the same posterior distribution. Our BCVI is then tested based on the Wiroonsri index (WI), and the Wiroonsri-Preedasawakul index (WP) as underlying indices for hard and soft clustering, respectively. We compare their outcomes with the original underlying indices, as well as a few more existing CVIs including Davies and Bouldin (DB), Starczewski (STR), Xie and Beni (XB), and KWON2 indices. Our proposed BCVI clearly benefits the use of CVIs when experiences matter where users can specify their expected range of the final number of clusters. This aspect is emphasized by our experiment classified into three different cases. Finally, we present some applications to real-world datasets including MRI brain tumor images. Our tools will be added to a new version of the recently developed R package ``UniversalCVI''.
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > California (0.04)
- Asia > Thailand (0.04)
A correlation-based fuzzy cluster validity index with secondary options detector
Wiroonsri, Nathakhun, Preedasawakul, Onthada
The optimal number of clusters is one of the main concerns when applying cluster analysis. Several cluster validity indexes have been introduced to address this problem. However, in some situations, there is more than one option that can be chosen as the final number of clusters. This aspect has been overlooked by most of the existing works in this area. In this study, we introduce a correlation-based fuzzy cluster validity index known as the Wiroonsri-Preedasawakul (WP) index. This index is defined based on the correlation between the actual distance between a pair of data points and the distance between adjusted centroids with respect to that pair. We evaluate and compare the performance of our index with several existing indexes, including Xie-Beni, Pakhira-Bandyopadhyay-Maulik, Tang, Wu-Li, generalized C, and Kwon2. We conduct this evaluation on four types of datasets: artificial datasets, real-world datasets, simulated datasets with ranks, and image datasets, using the fuzzy c-means algorithm. Overall, the WP index outperforms most, if not all, of these indexes in terms of accurately detecting the optimal number of clusters and providing accurate secondary options. Moreover, our index remains effective even when the fuzziness parameter $m$ is set to a large value. Our R package called UniversalCVI used in this work is available at https://CRAN.R-project.org/package=UniversalCVI.
- North America > United States > Wisconsin (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Asia > Thailand (0.04)
Are Cluster Validity Measures (In)valid?
Gagolewski, Marek, Bartoszuk, Maciej, Cena, Anna
Internal cluster validity measures (such as the Calinski-Harabasz, Dunn, or Davies-Bouldin indices) are frequently used for selecting the appropriate number of partitions a dataset should be split into. In this paper we consider what happens if we treat such indices as objective functions in unsupervised learning activities. Is the optimal grouping with regards to, say, the Silhouette index really meaningful? It turns out that many cluster (in)validity indices promote clusterings that match expert knowledge quite poorly. We also introduce a new, well-performing variant of the Dunn index that is built upon OWA operators and the near-neighbour graph so that subspaces of higher density, regardless of their shapes, can be separated from each other better.
- Europe > Poland > Masovia Province > Warsaw (0.04)
- Oceania > Australia (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Instructional Material (0.87)
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.67)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Clustering performance analysis using new correlation based cluster validity indices
There are various cluster validity measures used for evaluating clustering results. One of the main objective of using these measures is to seek the optimal unknown number of clusters. Some measures work well for clusters with different densities, sizes and shapes. Yet, one of the weakness that those validity measures share is that they sometimes provide only one clear optimal number of clusters. That number is actually unknown and there might be more than one potential sub-optimal options that a user may wish to choose based on different applications. We develop two new cluster validity indices based on a correlation between an actual distance between a pair of data points and a centroid distance of clusters that the two points locate in. Our proposed indices constantly yield several peaks at different numbers of clusters which overcome the weakness previously stated. Furthermore, the introduced correlation can also be used for evaluating the quality of a selected clustering result. Several experiments in different scenarios including the well-known iris data set and a real-world marketing application have been conducted in order to compare the proposed validity indices with several well-known ones.
A Simplified Framework for Air Route Clustering Based on ADS-B Data
Duong, Quan, Tran, Tan, Pham, Duc-Thinh, Mai, An
The volume of flight traffic gets increasing over the time, which makes the strategic traffic flow management become one of the challenging problems since it requires a lot of computational resources to model entire traffic data. On the other hand, Automatic Dependent Surveillance - Broadcast (ADS-B) technology has been considered as a promising data technology to provide both flight crews and ground control staff the necessary information safely and efficiently about the position and velocity of the airplanes in a specific area. In the attempt to tackle this problem, we presented in this paper a simplified framework that can support to detect the typical air routes between airports based on ADS-B data. Specifically, the flight traffic will be classified into major groups based on similarity measures, which helps to reduce the number of flight paths between airports. As a matter of fact, our framework can be taken into account to reduce practically the computational cost for air flow optimization and evaluate the operational performance. Finally, in order to illustrate the potential applications of our proposed framework, an experiment was performed using ADS-B traffic flight data of three different pairs of airports. The detected typical routes between each couple of airports show promising results by virtue of combining two indices for measuring the clustering performance and incorporating human judgment into the visual inspection.
- Europe (0.06)
- North America > United States > Washington > King County > Seattle (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (5 more...)
- Transportation > Air (1.00)
- Consumer Products & Services > Travel (1.00)
- Transportation > Infrastructure & Services (1.00)
An Internal Cluster Validity Index Based on Distance-based Separability Measure
To evaluate clustering results is a significant part in cluster analysis. Usually, there is no true class labels for clustering as a typical unsupervised learning. Thus, a number of internal evaluations, which use predicted labels and data, have been created. They also named internal cluster validity indices (CVIs). Without true labels, to design an effective CVI is not simple because it is similar to create a clustering method. And, to have more CVIs is crucial because there is no universal CVI that can be used to measure all datasets, and no specific method for selecting a proper CVI for clusters without true labels. Therefore, to apply more CVIs to evaluate clustering results is necessary. In this paper, we propose a novel CVI - called Distance-based Separability Index (DSI), based on a data separability measure. We applied the DSI and eight other internal CVIs including early studies from Dunn (1974) to most recent studies CVDD (2019) as comparison. We used an external CVI as ground truth for clustering results of five clustering algorithms on 12 real and 97 synthetic datasets. Results show DSI is an effective, unique, and competitive CVI to other compared CVIs. In addition, we summarized the general process to evaluate CVIs and created a new method - rank difference - to compare the results of CVIs.
- North America > United States > District of Columbia > Washington (0.04)
- North America > United States > Wisconsin (0.04)